refactor(config): replace hydra/omegaconf with typed pydantic+tyro by GrigoryEvko · Pull Request #21 · FusionBrainLab/gigaevo-core

GrigoryEvko · 2026-05-21T03:23:07Z

TL;DR

Replaces the Hydra/OmegaConf YAML configuration layer with typed Pydantic v2 schemas driven by a tyro CLI entry point. The runtime no longer parses YAML or resolves string interpolation; every experiment is a regular Python module that builds an ExperimentConfig and hands it to a renamed entry point at run.py.

Beyond the cutover itself, the audit surfaced and fixed 29 pre-existing bugs and latent defects — including a leaked API key in run-ID hashes, a race condition in the round-robin island selector, two O(N²) hot paths now 14–23× and 100–3200× faster, and a class of assert-based validators that silently corrupted output under python -O.

What this delivers

Typed config end-to-end. Pydantic v2 schemas in gigaevo/config/schemas/ cover every subsystem. Mistyped flags fail at parse time instead of mid-experiment.
Self-documenting CLI. python run.py <experiment> --help lists every overridable field with its description; Field(description=...) is present on every user-facing field.
Discoverable composition. Presets in gigaevo/config/{algorithm,engine,llm,pipeline,problem,runner}_presets.py compose into experiments via plain function calls — IDE jump-to-definition, autocomplete, and type-checker support all work.
Reproducibility artefact. Each run writes output_dir/{experiment_id}/config.json, where experiment_id = sha256(model_dump_json())[:12] — stable across reruns of the same config, distinct under any change.
Subprocess-isolated sweeps. gigaevo/sweep.py runs a parameter grid as N independent subprocesses, immune to GIL / global-state / CUDA-context leaks across runs.
Smaller dependency surface. Drops hydra-core + omegaconf; adds the lighter tyro. Faster import, fewer transitive deps.
Cleaner test isolation. No global resolver registry to leak between tests; no register_resolvers autouse needed.
No more MISSING sentinels. Pydantic validation refuses to construct a half-filled config, so subsystems never see partial state.
9 reference experiments under experiments/ demonstrate the patterns: env-driven secrets via default_factory, discriminated-union variants selected by kind=..., preset composition, override syntax.

Performance wins

Two long-standing O(N²) hot paths reworked to O(N) with explicit micro-benchmarks:

Hot path	Where	Speedup
`_compute_pareto_front`	`gigaevo/evolution/strategies/migrant_selectors.py`	14.9× at N=50, 22.9× at N=200, 23.1× at N=500
`EvolutionaryStatisticsCollector._process`	`gigaevo/programs/stages/collector.py`	114× at N=200×M=5, 570× at N=1000×M=10, 3205× at N=5000×M=50
`ChainFeatureExtractor.extract` regex pass	`gigaevo/evolution/scheduling/feature_extractor.py`	2× (single-pass merge of two `re.finditer` walks)

Reliability — bugs and latent defects fixed

The cutover audit went deep across the whole repo. 29 distinct issues were found and fixed:

Security / credentials

ChatOpenAIConfig.api_key leaked into reproducibility artefacts. The key was being serialised into output_dir/<experiment_id>/config.json, and worse, into experiment_id itself (sha256 of model_dump_json()), so run IDs depended on credential rotation. Pydantic exclude=True keeps the key in-memory only.
Real OpenRouter API key committed in problems/chains/musique/shared_config.py and problems/chains/musique_retrieval/shared_config.py. Replaced with os.environ.get("OPENROUTER_API_KEY", ""). (The leaked key remains in git history and should be rotated by its owner — third-party key, not anyone on this team.)

Correctness

RecordCardExtended.__init__ shadowed the dataclass init and never applied field(default_factory=...) defaults. Reading card.usage, card.keywords, card.evolution_statistics, card.works_with, or card.links raised AttributeError; dataclasses.asdict(card) exploded. change_motivation was mandatory in body but missing from required_fields, so import_idea_extended(is_forced=False) was dead on arrival. Lock-in tests added.
RoundRobinIslandSelector._idx race condition. Concurrent threads double-skipped or repeated islands. Added threading.Lock; 8-thread × 25-call uniform-histogram test asserts exact balance.
RidgePredictor.predict held the model lock across CPU-bound extract. Snapshot under lock, release, extract + predict on captured locals. Concurrency lock-in test added.
DagRunner.stop cancelled _metrics_collector_task without await → "Task was destroyed but it is pending" warnings + writer-ref retained past storage.close(). Now awaits with suppress(CancelledError).
DagRunner._launch fire-and-forget cancel on failed transition tasks → tasks lingered pending until GC. Routed through _cancel_task (which awaits with timeout).
cfg.problem.build() called twice in build_object_graph → double metrics.yaml reload per graph build. Threaded the already-built ProblemContext.
sweep._run_one aborted the entire pool on worker-spawn OSError (E2BIG/EMFILE/ENOMEM) — one bad spawn dropped every queued sibling run. Now logs and returns 1 to preserve "best-effort across all runs" semantics.
_dump_resolved_config orphaned .config.*.tmp files on write failure. Wrapped in try/finally with idempotent unlink.
chain_runner._run_chain_on_dataset_stepwise:429 was dropping the sample argument to _resolve_reference, so $sample.X references silently resolved to "". Latent landmine — the only consumer (musique_retrieval) routes through the non-stepwise variant today, but any future stepwise consumer would have been silently broken. Threaded dataset[i] through. 11 unit tests; reverted patch confirms the regression.
problems/prompts/utils.py:158 client.call_logs[0] dropped retry call logs and would IndexError on empty. New _aggregate_call_logs helper sums across all attempts; returns a zero CallLog on empty input. 3 unit tests.
remove_boxed in 3 problem helpers used bare assert s[:len(left)] == left and assert s[-1] == "}". Under regular Python they raised AssertionError on \boxed{42 (truncated) or \boxed{42}xyz (trailing garbage) and crashed the entire extract_answer loop in validate.py. Under python -O the assertions were stripped, yielding a corrupted slice. Replaced with explicit return None. 21 parametrized tests.
problems/prompts/ifbench/validate.py:37 mutated source DataFrame in place. to_dict(orient="records") shares list-cell references with the source DataFrame; the in-place rewrite corrupted subsequent iterations. Replaced with a local binding.
problems/prompts/jigsaw_community_rules/validate.py:46 returned None fitness on degenerate input; the downstream consumer in strategies/utils.py:79 does -value, which crashes on None. Returns 0.0 now.
3 chain validators had -> dict annotations on (metrics, failures) tuple returns — annotation lied about the contract. Fixed hover/static, hotpotqa/static_ra, hotpotqa/static_a.
gigaevo/__init__.py had a dead pydantic.config.configure(compile="jit") call — that API has never existed in pydantic 2.x. The surrounding try/except Exception: pass silently swallowed AttributeError on every package import. Removed.

Latent bugs (would have bitten under specific conditions)

tools/lineage.py:226 sort key returned float | None → TypeError if any program had None fitness. Use -math.inf substitute.
tools/lineage.py::_walk_lineage no cycle guard → infinite loop on corrupted parent chain (A→B→A or self-loop). Visited-set guard + 5 regression tests.
tools/redis2pd.py non-atomic df.to_csv → corrupt CSVs on concurrent runs or interrupts. Added _atomic_write_csv (tempfile + os.replace).
3 Redis clients leaked in tools/{utils,fitness_vs_time,throughput_plot}.py — the throughput plotter scaled the leak with fan-out. Wrapped in try/finally.
Deprecated asyncio.get_event_loop() at 11 callsites across test_bandit.py, test_coevolution_pipeline.py, test_redis_storage.py, test_wrapper_enhanced.py. Bites whenever any prior event loop in the thread has been closed (exactly what pytest-asyncio does between tests); Python 3.13 removes it entirely. Migrated to asyncio.run() / get_running_loop().
stage_timeout accepted on 6 builder schemas whose runtime constructor ignored it — silent user surprise. Moved to the two builders that actually consume it; lock-in tests reject the field on the others via extra="forbid".
DEFAULT_BINNING_TYPE: Final[str] mistyped against BinningType = Literal["linear"] → 5 invariance errors across algorithm_presets.py. Retyped.
LoggingConfig.build_writer annotated -> GenericLogger but returned CompositeLogger (siblings under LogWriter, not subclasses). The existing test already asserts CompositeLogger; annotation was the lie. Corrected.
experiments/prompt_coevolution.py main_redis_db=0 was a literal — overriding --redis.db N on tyro broke the coevolved-prompt fetcher (main wrote to DB N while the fetcher stayed at 0). Threaded redis.db through both sides.

Why now

The Hydra layer was leaking OmegaConf semantics into the runtime: MISSING sentinels reached object construction, ${ref:X} resolution ran lazily and produced unhelpful tracebacks, the global resolver registry made test isolation awkward, and YAML interpolation was being asked to do work that wanted real Python expressions. A typed config model removes that whole class of problem.

What changes for users

Entry point rename

- python -m hydra_main +experiment=steady_state ...
+ python run.py experiments/steady_state.py [overrides ...]

YAML → Python experiment

A YAML experiment becomes a Python module exposing a single experiment() function that returns ExperimentConfig. The 9 reference experiments in experiments/ show the patterns.

Overrides

Use tyro's dotted-path syntax instead of Hydra's +key=value:

python run.py experiments/steady_state.py --redis.db 7 --engine.max-generations 50

python run.py experiments/<file>.py --help lists every overridable field with its description.

Sweeps

gigaevo/sweep.py runs a parameter grid as N independent subprocesses, isolating GIL / global-state issues. Each run is invoked exactly as a normal run.py invocation; sweep definitions are Python dicts.

Schema surface

Module	Covers
`schemas/experiment.py`	`ExperimentConfig` root, `experiment_id` hash, cross-field validators
`schemas/algorithm.py`	island topologies, MAP-Elites, single/multi-island, discriminated union
`schemas/engine.py`	steady-state, generational, bus-backed; `BusedEngineConfig`
`schemas/pipeline.py`	DAG builder variants (`default`, `auto`, `context`, `optuna_opt`, `cma_opt`, `algotune_speed`, `structural_metrics`, `problem_specific`)
`schemas/llm.py`	`ChatOpenAIConfig` / bandit / heterogeneous router discriminated union
`schemas/redis.py` + `schemas/migration_bus.py`	dataplane and migration-bus connection settings
`schemas/problem.py`, `schemas/prompt.py`, `schemas/logging.py`, `schemas/scheduling.py`, `schemas/runner.py`	remaining subsystem configs

Field(description=...) is present on every user-facing field so --help is self-documenting.

Test plan

Conflict map with open PRs

This branch deletes the YAML tree and reshapes the config surface — that overlaps several open PRs at file-level only; the intent is orthogonal in every case:

PR	Overlap	Resolution
#2, #3, #4, #5, #6, #7, #8	none (fix-only PRs in disjoint modules)	clean merge in either order
#10 (sanitize)	`gigaevo/config/helpers.py` was reshaped here; `gigaevo/utils/text_sanitize.py` is unchanged in this branch	take both sides; helpers.py reshape supersedes pre-cutover lines
#11 (xdist)	`pytest.ini`, `pyproject.toml`	take #11's pytest config + this branch's tyro dep
#12, #13	LLM module only; no config-layer overlap	clean
#14 (loky-executor)	`gigaevo/entrypoint/constants.py`	take #14's constants reshape
#15 (error context)	`gigaevo/runner/dag_runner.py` unchanged here	clean
#16 (pipeline hygiene)	`gigaevo/config/helpers.py`, `gigaevo/entrypoint/default_pipelines.py`	both reshape the same modules; the second-merged PR rebases on the first
#17 (aiohttp)	`pyproject.toml`, `gigaevo/llm/models.py`, `gigaevo/infra/*` — config layer untouched	clean
#19 (asyncio-deprecation)	this branch independently migrated `asyncio.get_event_loop()` callers; #19 superset on `main` is preferred	rebase #19 onto post-cutover main; identical sites already on this branch can be dropped
#20 (dataplane-foundation)	deletes a different set of files (Redis substrate); doesn't touch `gigaevo/config/schemas/` or `experiments/`	clean at the config surface; engine wiring rebase needed

No PR is blocked by this branch; merge order is reviewer's preference.

Out of scope / follow-ups

tests/test_tools/test_manifest.py references a tools.experiment.manifest module that has never existed in the repo (its production module was never committed); pre-existing collection error. Not addressed here.
A handful of # type: ignore[misc] comments on MagicMock.__class__ rebinding in tests — documented pattern, won't repay refactoring.
tools/status.py / tools/fitness_vs_time.py redis-py Awaitable type stubs are a known false-positive class; documented via typing.cast, no runtime effect.

…rns AnyCard Core change: normalize_memory_card returns MemoryCard | ProgramCard (Pydantic models) instead of dict[str, Any]. All internal code uses attribute access (card.description) not dict access (card.get("description")). Production code changes: - card_conversion.py: normalize_memory_card returns AnyCard; card_to_concept_content, build_entity_meta, format_search_results, is_program_card all accept AnyCard - memory.py: self.memory_cards is dict[str, AnyCard]; _persist_index serializes via model_dump(); _synthesize_results uses model attribute access; save_card accepts dict | AnyCard at boundary - memory_write_example.py: load_memory_cards normalizes all output to AnyCard - models.py: ProgramCard gains keywords, strategy, links fields; validate_assignment=True for mutability Boundary pattern: - External input (JSON, API responses, user dicts) → normalize_memory_card → AnyCard - Internal operations → attribute access (card.field) - Serialization (JSON, API) → card.model_dump() at the boundary - card_update_dedup.py stays dict-based (LLM output parsing) — callers pass model_dump() when crossing the boundary 813 tests pass, ruff check + format clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: ideas_tracker cleanup — loguru + sys.path removal

refactor: dict → Pydantic — normalize_memory_card returns AnyCard

Automatically generated by python-semantic-release

Create gigaevo/memory/__init__.py with curated public API: - AmemGamMemory, MemoryCard, ProgramCard, AnyCard, ConnectedIdea - normalize_memory_card, GigaEvoMemoryBase - LocalMemorySnapshot, MemoryCardExplanation, Strategy Update gigaevo/memory/shared_memory/__init__.py with same exports. Users can now import from `gigaevo.memory` instead of deep paths: from gigaevo.memory import AmemGamMemory, MemoryCard 5 tests verify: __all__ completeness, package imports, subpackage imports, normalize roundtrip, AmemGamMemory construction. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Test gap: add LocalMemorySnapshot and Strategy to test_import_from_package_root (previously 2/10 exports untested for importability) - Circular import fragility: change `from gigaevo.memory import config` to `import gigaevo.memory.config as config` in 3 files — avoids relying on partial parent-package init during import chain Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): public API exports

Move all memory-related test files from tests/ root into tests/memory/ subdirectory. Zero test loss: 483 tests before, 483 tests after. Files moved: - test_amem_gam_memory.py (67 tests) - test_card_update_dedup_extended.py (75 tests) - test_memory_api_search.py (21 tests) - test_memory_card_update_dedup.py (6 tests) - test_memory_contracts.py (21 tests) - test_memory_cycle5.py (17 tests) - test_memory_deeper.py (21 tests) - test_memory_e2e_scenarios.py (21 tests) - test_memory_engine_interaction.py (10 tests) - test_memory_full_agentic.py (15 tests) - test_memory_integration.py (26 tests) - test_memory_known_bugs.py (20 tests) - test_memory_models.py (16 tests) - test_memory_operator_integration.py (14 tests) - test_memory_public_api.py (5 tests) - test_memory_with_fake_agentic.py (24 tests) - test_memory_write_example_extended.py (22 tests) - test_memory_write_program_cards.py (3 tests) - test_normalize_memory_card.py (66 tests) - test_pydantic_cards.py (13 tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion refactor(memory): consolidate test files into tests/memory/

Add `from __future__ import annotations` to all 41 memory module files that were missing it. Remove duplicate `_safe_get` from a_mem_memory_creation.py (now imports from utils.py). Auto-fix 4 UP037 violations (unnecessary quoted type annotations now that future annotations are active). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Was `-> dict[str, Any]` but actually returns `AnyCard` (Pydantic model). Found by chaos-hacker review — prevents TypeError trap for future callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): type quality improvements

Replaced all print() calls with loguru logger in 6 files: - A_mem/agentic_memory/memory_system.py: 4 prints → logger - A_mem/agent/agent_class.py: 2 prints → logger - GAM_root/gam/agents/research_agent.py: 38 prints → logger - GAM_root/gam/retriever/index_retriever.py: 2 prints → logger - GAM_root/gam/schemas/page.py: 2 prints → logger - GAM_root/gam/schemas/memory.py: 2 prints → logger Also removed old-style `logger = logging.getLogger(__name__)` in memory_system.py (replaced by loguru import). 509 tests pass, ruff clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: replace 50 print() with loguru in A_mem + GAM_root

Automatically generated by python-semantic-release

exp: hover/no-deep-retrieval — ablation of retrieve_deep (k=10)

Rename map: - test_normalize_memory_card → test_card_normalization - test_memory_card_update_dedup → test_card_dedup - test_card_update_dedup_extended → test_card_dedup_edge_cases - test_amem_gam_memory → test_memory_backend - test_memory_deeper → test_memory_backend_internal - test_memory_full_agentic → test_memory_backend_agentic - test_memory_with_fake_agentic → test_memory_backend_fakes - test_memory_cycle5 → test_api_sync - test_memory_operator_integration → test_mutation_operator - test_memory_engine_interaction → test_engine_integration - test_memory_write_example_extended → test_write_pipeline - test_memory_write_program_cards → test_write_programs - test_memory_e2e_scenarios → test_scenarios - test_memory_integration → test_roundtrip - test_memory_known_bugs → test_edge_cases - test_concept_api_client → test_api_client (moved to tests/memory/) - test_openai_inference → test_llm_inference (moved to tests/memory/) - test_data_components, test_runtime_config (moved to tests/memory/) 666 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: rename memory test files — descriptive names

Every suppression replaced with proper typing: - Stage base: ClassVar[type[StageIO]] instead of unbound TypeVars - json.py: single dumps/loads definitions with types.ModuleType backend - LLM agents: TypedDict fields widened to accept None (truthful initial state) - Redis coevolution: _get_redis() returns AsyncRedis instead of object - DAG/engine/trackers: invariant assertions replacing silent suppression - analyzer.py: fixed wrong return type (dict → IncomingIdeas) 4468 tests pass, lint clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: remove all 27 type: ignore comments

…p Dynamic Chains New problem variant: chains/hover/full7_no_deep (7-step max, standard retrieval only). Design approved by Reviewer-2. Two-phase protocol: Phase A builds memory bank, Phase B tests memory-augmented vs standard mutation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tr]) RecordCardExtended.aliases is list[dict[str, dict[str, str|list[str]]]] but MemoryCard.aliases expects list[str]. The _to_list() helper passed dicts through unchanged, causing Pydantic validation crash at the memory write pipeline step after ideas_tracker completes. Added _flatten_aliases() that extracts description strings from the nested dict format while preserving plain string aliases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

RecordCardExtended.aliases is list[dict] (version history with nested {experiment_id: {description, programs, explanations}}). The Pydantic migration (846299e) incorrectly typed MemoryCard.aliases as list[str], causing a validation crash when memory_write_pipeline passes ideas_tracker output through normalize_memory_card. Root cause: Pydantic migration assumed aliases are simple strings, but Petr's original design uses them as structured version history. Fix the type at the model level instead of adding a flattening adapter. Reverts the _flatten_aliases band-aid from the previous commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nLab#2, PR #161) Added tests that verify normalize_memory_card and load_memory_cards handle ideas_tracker's structured alias format (list[dict] version history) without crashing. This test would have caught the Pydantic type mismatch that crashed the memory write pipeline (aliases: list[str] → list[dict]). Tests added: - test_aliases_with_ideas_tracker_dict_format (test_normalize_memory_card.py) - test_aliases_mixed_types (test_normalize_memory_card.py) - test_ideas_tracker_dict_aliases_preserved (test_memory_write_example_extended.py) All 91 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Redis WATCH deprecation: use pipeline context in archive_storage.py - AsyncMock coroutine warnings: set storage.snapshot = MagicMock() (bump() is sync) - ast.Str deprecation: use ast.Constant only (Python 3.14 compat) - Optuna ExperimentalWarning: suppress around TPESampler/PedAnovaImportanceEvaluator - Unclosed file handles: pathlib.Path.read_text() in test_scheduling.py - matplotlib tight_layout: layout="tight" on subplots() in comparison.py - Island __len__ RuntimeWarning: suppress in intentional error test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Automatically generated by python-semantic-release

The previous literal token was a real, live OpenRouter credential committed to the source tree. Switch both musique and musique_retrieval shared configs to read OPENROUTER_API_KEY from the environment so the value is no longer redistributed with the repo. The committed key must still be revoked and rotated.

…cstrings Rewrites the lineage-race and ingestion-atomicity class docstrings to describe the present-tense contract being exercised rather than narrating prior root-cause investigations or pinning to brittle source-file locations.

pydantic 2.x has no module-level ``configure`` API; the call has been raising ``AttributeError`` since the rename and the surrounding try/except swallowed it on every import. Removes the block, the import, and the silenced exception path.

… arg The Top-N path called list.sort() with a key returning float | None which raises TypeError if any program has None fitness. The previous filter suppressed that in practice, but a stray None would crash mid-sort. The fallback now substitutes -inf so None values sink to the end deterministically. _walk_lineage accepted a metric argument it never consulted; the chain walk depends only on parent edges. Drop it from the signature and call sites (including the regression tests).

…nce check _compute_pareto_front extracted fitness values inside the inner O(N^2) loop, re-running extract_fitness_values twice per pair. Pre-extract once per program then iterate over the cached vectors, matching the pattern already used by ParetoFrontArchiveRemover.order_candidates. Micro-benchmark with 3 fitness keys (N candidates): N=50: 9.53 ms -> 0.64 ms (14.9x) N=200: 109.32 ms -> 4.77 ms (22.9x) N=500: 365.29 ms -> 15.80 ms (23.1x) tests/evolution/test_migrant_selectors.py: 10 passed.

spec_from_file_location returns ModuleSpec | None and its loader is Optional. The previous direct .loader.exec_module() chain would raise AttributeError instead of a useful message if either was None. The assert documents the invariant and gives pyright the narrowing it needs.

…h threading.Lock The round-robin index was advanced under no synchronization. Concurrent callers from multiple OS threads could read the same value, write the same successor, and either repeat or skip an island. Wrap the RMW in a threading.Lock and pin the local index for the return. Covered by a new test that drives select_island from eight threads, each with its own event loop, and asserts a perfectly uniform island histogram across 200 advances.

…action RidgePredictor.predict previously held the model lock across the extractor.extract call and the sklearn predict call. Extraction is a pure, potentially expensive operation that does not touch any of the predictor's mutable state, so serializing concurrent predictions through it is wasted contention. Snapshot the (model, feature_keys) pair under the lock, then release it before extracting features and invoking predict on the captured local references. The sklearn model is immutable after fit, so the captured reference remains valid for the duration of the call. The no-model fallback behaviour is preserved exactly. A new probe test acquires the predictor lock non-blocking from inside a custom extractor and asserts it is free on every concurrent call, so a regression that re-introduces lock-held extraction would surface as a False in the recorded lock-state list.

EvolutionaryStatisticsCollector._process re-filtered the full population by iteration metadata for every program in the snapshot, repeating the O(N) scan N times. Bucket programs by iteration once in _ensure_population_cache (alongside the existing per-generation cache) then look up the iteration entry by key. Skipping programs whose iteration metadata is absent preserves the existing None-iteration fallback when the snapshot excludes metadata. Micro-benchmark of the filter pattern (M iterations across N programs): N=200 M=5: 2.99 ms -> 0.03 ms (~114x) N=1000 M=10: 111.59 ms -> 0.20 ms (~570x) N=5000 M=50: 8410.70 ms -> 2.62 ms (~3205x) tests/stages/test_collector.py: 29 passed. tests/benchmarks/test_collector_scaling.py: 12 passed.

The stepwise tool-step path passed (ref, outer_context, step_outputs) to _resolve_reference but omitted the per-sample dict, so $sample.X references silently resolved to the empty string. Latent today because no enabled stepwise consumer depends on $sample.* yet, but it is a correctness landmine for future tool inputs that need sample fields. Add a regression test covering the stepwise dispatch path plus the existing reference-resolution branches.

Every public field on the typed config schemas gains a one-sentence description. The CLI's --help layer (tyro) reads these and renders them next to each flag, so end users see what every override does instead of just the default value. Covers algorithm, engine, experiment, llm, logging, migration_bus, pipeline, problem, prompt, redis, runner, and scheduling. Internal fields kept under a clear class-level docstring (the discriminated-union markers and structural list fields whose semantics are explained in the class header) are left alone.

_process_sample read client.call_logs[0], which both IndexErrors when no log was appended and silently drops every retry attempt beyond the first. The retry decorator on LLMClient.__call__ can push multiple entries (each successful API hit appends one) before the call that yielded the returned response, so the existing read understated the sample's budget consumption. Introduce a private aggregator that sums prompt_tokens, completion_tokens, cost, and cost_utilization across all per-attempt entries, and falls back to a zero CallLog on the empty-list branch. The fix is contained to utils.py and does not touch the fenced client.

remove_boxed previously used bare ``assert`` statements to enforce boxed-expression shape: ``\boxed{42xyz`` (trailing garbage) and ``\boxed{42`` (missing closing brace) both raised AssertionError, and under ``python -O`` the assertions are stripped — turning structural checks into silent fall-through that corrupts the returned slice. Replace the asserts with explicit ``return None`` guards, matching the existing "no boxed found" branch. Well-formed input keeps producing the same string; malformed input now folds into the standard extraction- failure path that callers already handle by counting None predictions. Applied to all three sibling copies (chains/aime, prompts/aime, prompts/gsm8k) and covered by a parametrized regression suite.

Each experiment module now states the problem, the algorithm / pipeline / engine / LLM choice it showcases, and any unusual constraint in 2-4 lines so a user scanning the experiments/ directory can pick the right starting point without reading the body. ``runner_presets`` gains the same compose-into-experiment example the other ``*_presets`` modules already carry so the surface is uniform across the preset layer.

The TYPE_CHECKING guard contained only a 'pass' placeholder. Remove it along with the unused TYPE_CHECKING import.

Fix B007 in tools/throughput_plot.py and tools/wizard/__main__.py where the loop control variable is discarded inside the body.

PIE790: each exception class already has a docstring, which satisfies the suite's body requirement on its own.

Three chain validators return `(metrics, failures)` tuples but advertise `-> dict`. The runtime contract in `CallValidatorFunction.parse_output` already accepts both shapes, so behaviour is unchanged — this is a pure annotation/docstring repair so type-checkers and readers see the actual return type. Touched: chains/hover/static, chains/hotpotqa/static_ra, chains/hotpotqa/static_a.

The per-instruction loop wrote the None-stripped kwargs dict back into `input["kwargs"][index]`. Because `DataFrame.to_dict(orient="records")` shares the underlying list cells with the source frame, that write poisoned the dataset for any subsequent validate() call that reused the cached frame. Filter into a local dict instead; the dataset stays pristine across iterations.

…scorable `calculate_fitness` returned `None` when no rule had multi-class coverage. The selectors call `extract_fitness_values`, which negates `value` for minimization objectives — a `None` propagates as a `TypeError` on `-None`. Substitute `0.0` so degenerate batches surface as "no signal" rather than crashing the engine, and annotate the function with `-> float` to document the contract.

`tyro.cli(..., args=["--help"])` always raises `SystemExit(0)` via argparse, so the trailing `return 0` could never execute. Remove the dead line and document the exit semantics inline so future readers don't reintroduce the assumption that control falls through.

- redis/metrics._flatten_numbers: the ternary on key construction had identical 'then' and 'else' expressions; collapse to a single literal. - tools/lineage: tools/**/*.py already ignores E402 globally, so the per-import noqa: E402 directives are redundant.

RUF059: serve_until_signal discards the 'done' set returned by asyncio.wait, and trajectory only reads prev_v from the trailing improvement_points tuple. Prefix with underscore so the intent is visible at the unpack site.

redis-py stubs share signatures between the sync and async clients, so return types widen to Union[Awaitable[T], T]. The sync client always returns the concrete value, but pyright narrows on the Awaitable side and flags every lrange/hgetall/keys site. Add typing.cast narrowings where the call sites are; behaviour at runtime is unchanged.

The smoothed array is built across five branches; one path yields a pandas Series.values whose dtype the numpy stubs cannot align with the boolean-indexed __setitem__ signature. asarray pins the runtime type without changing the produced values.

…arnings MagicMock spoofs isinstance checks by rebinding __class__; the type checker rejects the assignment, but the runtime pattern is documented behaviour. Annotate the two assignments with the standard misc ignore.

… literal The constant was annotated Final[str], so preset builders passed list[str] into BehaviorSpaceConfig.binning_types whose declared list[BinningType] is invariant. Re-typing the constant against the schema literal lets the presets typecheck without runtime change. The import is aliased with an underscore prefix so the defaults namespace stays free of foreign symbols.

init_composite returns CompositeLogger, which is a sibling of GenericLogger under LogWriter rather than a subclass. The previous GenericLogger return annotation misrepresented the concrete return and broke type narrowing on every caller; the unit test already asserts isinstance(writer, CompositeLogger).

…rite

KhrulkovV and others added 30 commits April 2, 2026 15:50

Merge pull request #152 from KhrulkovV/refactor/ideas-tracker-cleanup

6d9144f

refactor: ideas_tracker cleanup — loguru + sys.path removal

Merge pull request #153 from KhrulkovV/refactor/memory-pydantic

be0a430

refactor: dict → Pydantic — normalize_memory_card returns AnyCard

1.26.0

4460321

Automatically generated by python-semantic-release

fix: lint import sorting in A_mem + GAM_root (pre-existing)

d6b87b4

Merge pull request #154 from KhrulkovV/refactor/memory-public-api

dc565eb

refactor(memory): public API exports

Merge pull request #155 from KhrulkovV/refactor/memory-test-consolida…

29df034

…tion refactor(memory): consolidate test files into tests/memory/

fix(memory): correct concept_to_card return type annotation

f4767a4

Was `-> dict[str, Any]` but actually returns `AnyCard` (Pydantic model). Found by chaos-hacker review — prevents TypeError trap for future callers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request #156 from KhrulkovV/refactor/memory-quality

e1def9b

refactor(memory): type quality improvements

fix: format card_conversion.py

9eaebd8

Merge pull request #157 from KhrulkovV/refactor/agentic-quality

4ffad6b

refactor: replace 50 print() with loguru in A_mem + GAM_root

1.27.0

2a86efe

Automatically generated by python-semantic-release

merge main into exp/hover-no-deep-retrieval (resolve plot conflict)

21fe62f

Merge pull request #150 from KhrulkovV/exp/hover-no-deep-retrieval

22a514f

exp: hover/no-deep-retrieval — ablation of retrieve_deep (k=10)

Merge pull request #160 from KhrulkovV/refactor/test-rename

b488828

refactor: rename memory test files — descriptive names

Merge pull request #162 from KhrulkovV/refactor/remove-type-ignores

4a040d6

refactor: remove all 27 type: ignore comments

style: ruff format

c221945

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1.28.0

8d1354d

Automatically generated by python-semantic-release

GrigoryEvko added 28 commits May 19, 2026 04:10

chore(entrypoint): drop empty TYPE_CHECKING block in evolution_context

b3a646b

The TYPE_CHECKING guard contained only a 'pass' placeholder. Remove it along with the unused TYPE_CHECKING import.

chore(tools): prefix unused loop variables with underscore

9d1669d

Fix B007 in tools/throughput_plot.py and tools/wizard/__main__.py where the loop control variable is discarded inside the body.

chore(exceptions): drop redundant pass after docstring

fbf97ef

PIE790: each exception class already has a docstring, which satisfies the suite's body requirement on its own.

KhrulkovV force-pushed the main branch from 054df39 to 0f2b866 Compare May 26, 2026 09:37

chore: empty commit to refresh PR mergeability after main history rew…

fc83833

…rite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(config): replace hydra/omegaconf with typed pydantic+tyro#21

refactor(config): replace hydra/omegaconf with typed pydantic+tyro#21
GrigoryEvko wants to merge 943 commits into
FusionBrainLab:mainfrom
GrigoryEvko:feat/hydra-cutover

GrigoryEvko commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GrigoryEvko commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What this delivers

Performance wins

Reliability — bugs and latent defects fixed

Security / credentials

Correctness

Latent bugs (would have bitten under specific conditions)

Why now

What changes for users

Entry point rename

YAML → Python experiment

Overrides

Sweeps

Schema surface

Test plan

Conflict map with open PRs

Out of scope / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GrigoryEvko commented May 21, 2026 •

edited

Loading